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Abstract 

We aim at the construction of a Hidden Markov Model (HMM) of as- 
signed complexity (number of states of the underlying Markov chain) 
which best approximates, in KuUback-Leibler divergence rate, a given 
stationary process. We establish, under mild conditions, the existence 
of the divergence rate between a stationary process and an HMM. 
Since in general there is no analytic expression available for this di- 
vergence rate, we approximate it with a properly defined, and easily 
computable, divergence between Hankel matrices, which we use as our 
approximation criterion. We propose a three-step algorithm, based 
on the Nonnegative Matrix Factorization technique, which realizes an 
HMM optimal with respect to the defined approximation criterion. A 
full theoretical analysis of the algorithm is given in the special case of 
Markov approximation. 



1 Introduction 



Let {Yt,t € Z} be a stationary finitely valued stochastic process that admits 
a representation of the form Yt = f{Xt) where {Xt,t € Z} is a finite Markov 
chain and / is a many-to-one function. We call such a process a Hidden 
Markov Model (HMM). Other definitions of HMM's have been proposed 
in the literature (and we will adopt a specific one, taken from |20l I21j . in 
subsequent sections of the present paper), but they are all equivalent to the 
present one which has the advantage of simplicity and serves well for an 
introductory section. The cardinality of the state space of the Markov chain 
Xt is called size of the HMM. 

The probabilistic characterization of HMM's was first given by Heller [13j in 
1965. The problem analyzed was: among all finitely valued stationary pro- 
cesses Yt, characterize those that admit an HMM representation. To some 
extent the results in [I3j are not quite satisfactory, since the proofs are 
non-constructive. Even if Yt is known to be representable as an HMM, no 
algorithm has been devised to produce a realization i.e. to construct, from 
the laws of It, a Markov chain Xt and a function / such that Yt ~ f{Xt) 
(i.e. they have the same laws). As stated, the problem has attracted the 
attention of workers in the area of Stochastic Realization Theory, starting 
with Picci and Van Schuppen pT|, see also Anderson [2|. More recent refer- 
ences with related results are Vidyasagar [24j and Vanluyten, Willems and 
De Moor [23j. While some of the issues have been clarified a constructive 
algorithm is still missing. 

In this paper we direct our attention to the approximation of stationary 
processes by HMM's and propose a constructive algorithm, based on Non- 
negative Matrix Factorization (NMF), that results in the approximate re- 
alization of a best HMM. More specifically, given a stationary process Yt, 
we consider the problem of optimal approximation of Yj within the class of 
HMM's of assigned size. The optimality criterion we adopt is the informa- 
tional divergence rate between processes. The optimal HMM exists, but is 
not unique. We construct an approximate realization of an optimal HMM 
by recasting the problem as a NMF with constraints, for which we devise 
a three step algorithm. A remarkable feature of the proposed algorithm is 
that, in the case of Markov approximation, it produces the explicitly com- 
putable optimal solution. In the special case of Yt being itself an HMM, the 
algorithm can be used to construct an approximate realization. 

In [15] numerical procedures for NMF have been proposed and convergence 
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properties of some of them have been studied in [10]; they turn out to be 
very close to those of the EM algorithm [25j, although the algorithm for 
NMF is completely deterministic. 

The remainder of the paper is organized as follows. Section 2 contains pre- 
liminaries on HMMs. In Section 3 the realization problem is posed, as well 
as an approximate version of it in terms of divergence rate. Section 4 estab- 
lishes the existence of the divergence rate between a stationary process and 
an HMM. In Section 5 the Hankel matrix of finite dimensional distributions 
is introduced, whereas in Section 6 we show its relevance for the approxi- 
mate realization problem. Finally, in Section 7, we propose the algorithm 
to find the best approximation and verify its ideal behavior in the case of 
approximation by a Markov chain. If the given process is an HMM itself 
of the same size as the approximating HMM, then we also show that the 
algorithm produces an HMM that is equivalent to the given one. 

This paper develops and extends some preliminary ideas presented in [9]. 

2 Mathematical Preliminaries on HMM's 

In this paper we consider discrete time Hidden Markov Models (HMM) with 
values in a finite set. We follow [201 [21], see also [2], for the basic definitions 
and notations. 

Let (It)jgz be a discrete time stationary stochastic process defined on a 
given probability space {0, A, P} and with values in the finite set (alphabet) 
3^ = {yi, 1/2, ■ ■ ■ ,yrn}- y* will denote the set of finite strings of symbols from 
the alphabet 3^, with the addition of the empty string denoted 0. For any 
V € y* , let \v\ be the length of the string v. By convention |0| = 0. If 
u,v (z y* , we denote by uv the string obtained by concatenation of v to u. 

For any n G N, let 3^" be the set of all strings of length n, with the obvious 
inclusion C y* . We denote by Y^^ = {l^+i, 1^+2, • • •} the (strict) future 
of the process Yt after t and by Yf = {..., Yt-i,Yt} the past of the process 
Yt before and up to t. The event {Ys+i, ... ,Yt = v} is represented by Y^ = v, 
for any v £ y* with \v\ = t — s. By convention {Y^^ = 0} = {Yf' = 0} = i7. 
For any v G y* we use {Y^^ = v} as a shorthand notation for the event 

{y/+H = 

Since {Ij} is stationary, the probability distribution of the sequence Y^^ is 
independent of t. This distribution induces a map p : 3^* — > [0, 1] with the 
following properties 
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(a) p{v) = P{Y+ = v) G y* 

(6) p{0) = 1 

(c) < p{v) < 1 yvey* 

id) Ever^ ?'(™) = ^(^) ^ ^* ^ ^• 

The map p represents the finite dimensional probabihty distributions of 
the process {Yt)t<^z, sometimes referred to as pdf. Notice that the special 
case of (d), when u = 0, provides for all n € N the standard property of 
a probability measure on y^: X^^g^n p(f ) = 1- The following definition 
basically originates with [5], where actually a control setting is considered. 
We adopt the formulation from |20j . 

Definition 2.1. A pair {Xt,Yt)tez of stochastic processes taking values in 
the finite set X xy is said to be a stationary finite stochastic system (SFSS) 
if 

i) {Xi,,Yt) is jointly stationary. 

ii) For allt £Z, a £ X*, V £y* it holds that 

P{X+ = a,Y+ = v\X^,Yf) = P{Xt = a,Y+ = v\Xt). (1) 

The processes {Xt)t&'L and {Yt)t<^i are called respectively the state and the 
output of the SFSS. 

Definition 2.2. A stochastic process iYt)tez with values in 3^ is a Hidden 
Markov Model (HMM) if it has the same distribution as the output of a 
SFSS. 

From the splitting property ([I]) it follows immediately that 

1. {Xt,Yt)te'E is a Markov chain. 

2. {Xt)t(^z is a Markov chain. 

3. The past and the future of Yj are conditionally independent given Xt^ 
i.e. for allteZ and v e y* 

P{Y+ = v\Xt, Yf) = P{Y^+ = v\Xt). (2) 

The representation of an HMM as the output process of a SFSS is not unique. 
The cardinality of X is called size of the representation of the HMM. The 
smallest size of a representation is called order of the HMM. In this paper 
we assume that the cardinality of ^ is and that of y is m. 
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Remark 2.3. The probability distribution of a stationary HMM is specified 
by 

• the m nonnegative matrices {M{y) , y € y} of size NxN with elements 

rriijiy) = P{Yt+i = y, Xt+i =j\Xt = i). (3) 

• a probability (row) vector vr of size N , such that vr = vr^, where 

A:=Y,M{y). 

y 

The matrix A is the transition matrix of the Markov chain {Xt)t^j^ and vr 
is an invariant vector of A. Since the state space X is finite, the Markov 
chain {Xt)t^'^ admits an invariant distribution, see [TO], which is unique if 
A is irreducible. 

We extend the definition in ^ to strings v ^y* as follows. 

Definition 2.4. Let v he a, string in y* of arbitrary length, k say. Then 
M(?j) e M^^^ is defined by 

m,j{v) = P{Yt'' = v,Xt+k = j\Xt = i). 

An immediate consequence of Definition 12.11 is that the following semigroup 
property holds 

M{uv) = M{u)M{v) Vn, v e 3^*. 

Let w E 3^" be given by = yi • • • y„. The map p then satisfies p{w) = 
PiYi = yi, . . . ,Yn = yn) and can be written in terms of the matrices M{yi) 
as 

p{w)=TrM{yi)---M{yn)e, (4) 
and for any pair of strings u and v in 3^*, one has 

p(uv) = TTM{u)M{v)e, (5) 

where e = (1, . . . , l)""". 

In the case of an HMM which has a representation of size A^, its finite 
dimensional distributions are completely determined by the values of p{u), 
for all strings u of length at most equal to 2N, see [8j for an easy proof of 
this statement, or [5j for more involved arguments leading to a proof that 
in fact lengths of at most 2N — 1 suffice. 
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Under the slightly restrictive factorization hypothesis: 

P(Yt+i = y,Xt+i=j\Xt = i) 

= P{Yt+i = y I Xt+i = j)P{Xt+i =j\Xt = i), Vt, y, i,j 

it is possible to reparametrize the pdf. 
Define 

hy := P{Yt = y\Xt = i) 
By := diag{biy,b2y,- ■ -bNy}- 

The factorization hypothesis then reads 

M{y) = ABy, 

from which one derives the classical Baum formula, see [3], 

p{w) = TTABy, ■■■ABy^e, 

which is the most widely used definition of HMM in the signal processing 
literature, see [22] , 

If y = f{X), a deterministic function of X, then biy E {0,1} with 
biy = 1 iff /(i) = y and the factorization hypothesis holds. Since it is always 
possible to represent an HMM as a deterministic function of a MC, one may 
assume without loss of generality the factorization hypothesis. In general 
this results in an unnecessarily large state space. In the present paper this 
additional assumption, however, is irrelevant. 

3 Realization for HMMs 

First we recall the weak stochastic realization problem |20] for HMMs, which 
is as follows. Let Y be an HMM with law Py(-), find an SFSS {X,Y) such 
that the law Py(-) coincides with -Py(-)- Any such SFSS is called a (weak) 
realization of Y. Since the laws Py{-) and -Py(-) are completely specified 
by the corresponding finite dimensional distributions py(-) and Py(-)i the 
problem reduces to finding matrices M{y) that specify the distribution of 
the SFSS (X,Y), see Remark 12. 3i The realization is inherently non-unique. 

In order to solve this problem one needs a characterization of the dis- 
tribution of an HMM. This characterization is given by Heller [13] (Theo- 
rem [3T] below). In the formulation of the theorem we need some additional 
concepts. Let C* be the convex set of probability distributions on 3^*. A 
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convex subset C C C* is polyhedral stable if (i) C = conv {qi{-),-" jQci')}, 
the convex hull of finitely many probabilities qi{-) and (ii) for 1 < i < c and 
yy & y the finite dimensional conditional distributions qi{- \ y) := £ C- 

Theorem 3.1 (Heller). Py(-) is the distribution of an HMM iff the set 
Cy '■= conv{py(- \ u) : u ^ y*} is contained in a polyhedral stable subset of 
C*. 

The realization problem is unsolved in general and Heller's theorem, al- 
though it gives a complete characterization, is not useful to find a concrete 
realization, that is finding the matrices M[y). For partial results we refer 
to [2] and [24]. 

In the present paper we propose to look for an approximate realization. The 
advantage of this alternative approach is that it can also be used as a proce- 
dure to approximate any given stationary distribution by that of an HMM. 
We formulate this approximate realization problem as a problem of optimal 
approximation in divergence rate, to be defined in the next section. 

Problem 3.2. Given Q, a stationary measure on 3^°°, and € N, find the 
distribution of a stationary HMM measure of size N , P* say, that is closest 
to Q in divergence rate, i.e. solve 

D{Q\\P*)=inlD{Q\\P\ (6) 
where the infimum is taken over all stationary HMM distributions of size A^. 

4 Divergence rate, existence and minimization 

In this section we recall the definition of the divergence rate between pro- 
cesses, as previously given in for instance [14] for two HMMs, and we show, 
under a technical condition, that the divergence rate between a stationary 
process and an HMM is well defined. 

Consider a process Y = {Yt)t&'L with values in y under two probability 
measures P and Q. We interpret P and Q as the laws of the process in 
the path space y°° . Let p(yo, ■■■,yk) = -P(>o = Vq, ■ ■ ■ ,Yk = Vk) and g(-) 
likewise. Recall the following fact. For varying arguments (together with 
their length), the functions p, g : 3^* — > [0, 1] represent the finite dimensional 
distributions of Y under each of the measures P and Q. For reasons of 
brevity, we write piY^) for the likelihood piY^, . . . , Y^) and likewise we also 
write qiY^). 
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Definition 4.1. Let Q and P be measures on 3^°° with q and p as the corre- 
sponding famihes of finite dimensional distributions. Define the divergence 
rate of Q with respect to P as 



if the limit exists. 

In the next theorem we establish the existence of the divergence rate between 
a stationary process and a stationary HMM under some restrictions. The 
approach we follow for the proof is inspired by analogous results in [16] 
and [H], although the arguments given in [IT], where the divergence rate 
between two HMMs is studied, could also be adapted. In the proof we use 
the following notation. If i? is a set of real numbers, then min^ R denotes 
the minimum of the strictly positive elements of i?, if it exists, which is of 
course the case when R is finite and contains at least one positive number. 

Theorem 4.2. Let Y he a process with values in y . Let Q be an arbitrary 
stationary distribution of Y on and P a stationary HMM distribution 
on y°° . Assume that 

(i) the distributions of all finite segments [Yq, . . . , Yn-i) under Q are ab- 
solutely continuous with respect to those under P. 

(a) Q admits an invariant probability measure fi* on y i.e. 



D{Q\\P) := lim -Eq log 




(7) 



l^*{y) = = y\Yo = 2/o)/^*(2/o). 



2/0 



(Hi) 



{Yt)teZ is geometrically ergodic under Q i.e. 3p S (0, 1) 



\Q(Yn = y\Yo = yo) - Q{Yn = y\Yo = y'o)\ = 0(/>") Vy,yo,yo G 3^- 



Then the limit in ^ exists and is finite. 

In order to prove Theorem 14.21 we need a technical lemma. 

Lemma 4.3. Under the assumptions of Theorem \4.S\ there exists a constant 
c G ( — 00,0) such that 



n^oo n 



lim — logpfyn" ^) = c a.s. with respect to Q. (8) 
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Proof. This Lemma represents a special case of Proposition 4.3 of [18] . 
Assumption A of [18] is replaced with our assumptions (ii) and (in). As- 
sumption B of [T8] plays no role in the present context. Assumption C of [T8] 
can be dispensed with, since the alphabet is finite. □ 

Proof of Theorem 14.21 

From the definition of divergence rate in formula d?]) we see that we have to 
establish the existence of the limit, as n tends to infinity, of 

^EQlogq{Y,^-') - ^EglogpiY^^-'). (9) 

For the first term in ([9]) we note that —Eq [log q(YQ~^^] is the entropy of 
q{Y^-^) and therefore -^EQlogq{Yf^~^) converges to H{Q), the entropy 
rate of Q, which is finite, because of stationarity and the fact that y is finite, 
see [111 Lemma 2.4.1]. Therefore it is sufficient to show that the second term 
in ([9]) has a finite limit, for which we use Lemma 14.31 Let yo,. . . ,yn-i be 
a string in 3^* with positive probability under Q. By absolute continuity, 
assumption (i), it also has positive P-probability. Now we exploit the fact 
that Y is an HMM under P. In particular, it follows from ^ that there are 
indices io, • • • , in-i such that 

TTiomoh (yo) • • • (yn-i) > 0. 

Since the set R of all probabilities vr^ and mij{y) is finite, we have 6 := 
min^ R > 0. Hence, we conclude from the above displayed inequality that 
P(.yo~ ) — ^"'^^^ from which we obtain that 

piY^-^) > Q-a.s. 

So 

^^5^ log 5 < - logpiYr^-^) < Q-a.s. 
n n 

Moreover, by Lemma 14.31 

lim - logpfyn""^) = c Q-a.s. 

n—foo n 

Then the dominated convergence theorem can be applied to conclude that 
^EQ\ogp{YQ~^) admits the finite limit c. □ 

Remark 4.4. It is possible to show a uniform version of Theorem 14.21 
the uniform convergence of the divergence rate with respect to P, under 
more stringent conditions on the approximating model class. For details on 
a closely related problem we refer to [18j, in particular Theorem 4.4. 
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A priori no extra information is available about the given stationary measure 
Q. Therefore it is useful to give conditions on the parameters mij{y) of the 
HMM measure P to ensure the absolute continuity condition of Theorem l4.2l 
for any given stationary measure Q. If Q is arbitrary, then in principle all 
probabilities q{yQ~^) can be strictly positive, therefore we give a sufficient 
condition that entails that all p{yQ~^) are positive. We formulate this as a 
corollary. 

Corollary 4.5. Let Q and P be as in Theorem \4-.^ with (i) replaced by 

(0 J^m,,(y) =P(yt+i =y|Xj =i) >0, Vy G 3^, Vi G ^. (10) 

j 

Then all finite strings have positive probability under P and hence the limit 
in ^ exists. 

Proof. Let 5' = minP(yfc = = i) > 0, which is strictly positive by 

the hypothesis. Then, for any y & y: 

P{Yk = y\ Y,'-') = P{Yk = y, X,_, = i\ Y^') 

i 

= P{Yk = y\ Xk-i = i) P{Xk-i = i\ Y^-^) 

i 

>5'YnXk-i=i\Yt^) = 5'. 

i 

By iteration of this inequality applied to p{yQ~^) = p{yo) YYkZi p{yk\yo~^) , 
the result follows, since ([10]) also implies that p{yo) = J2ij '^i'<^ij{yo) > 0. □ 

Remark 4.6. The Condition (jlOp of Corollarv 14.51 mav appear restrictive, 
but in absence of any additional knowledge about Q, one can not completely 
avoid it. To illustrate this, let us assume that P is such that Y is Markov. 
Since in principle all strings j/g"^ may have positive Q-probability, the same 
must hold under P, but this means that all transitions i ^ j have positive 
probability, so Aij > for all This is precisely Condition (|10|) in the 
present context. 

We return to Problem 13.21 This problem is well defined under the condi- 
tions of Theorem 14. 2t since the divergence rate is then guaranteed to exist. 
There is however a major problem. No analytic expression is known for the 
divergence rate, when Q is arbitrary and P an HMM measure (except for a 
Markov law P, that we will treat in Remark 14. 7|) . This is even the case if 
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Q itself is an HMM measure, see [12] for some recent results. A similar ob- 
servation has already been made in where the entropy rate of an HMM 
was studied for the first time. In fact, in the latter paper, the only non- 
trivial example for the entropy rate is given in the form of an infinite series 
example. This motivates an alternative approach. In the next section we 
will approximate the abstract Problem 13.21 with a, in principle, numerically 
tractable one. For this we will need the Hankel matrix involving all finite 
dimensional distributions of a stationary process and that of an HMM. This 
is the topic of the next section. 

Remark 4.7. The minimization problem can be solved explicitly if P runs 
through the set of all stationary Markov distributions. First we recall that 
the existence of the divergence rate when P is the distribution of a Markov 
process (or A:-step Markov process) with transition matrix A is much easier 
to establish. Inspection of the proof of Theorem 14.21 reveals that the entropy 
term —H{Q) remains. We now explicitly compute the second term in 
Since P is a Markov law, we have p{Yq~^) = piYo)W^^^p{Yj\Yj^i). But 
then 

n 

Eq \ogp{Y^-^) = Eq logp(yo) +Y.^Q ^ogp{Yj\Y,^i), 
and EQ\ogp{Yj\Yj^i) = £'q logp(Yi[yo)i by stationarity. Hence 

^EQ\ogp{Y^-^) ^ i^Q iogp(yi|yo). 

To guarantee that the latter expectation is finite for arbitrary Q, one imposes 
that all elements of A are positive, see Remark 14.61 This condition can be 
relaxed if it is known that for certain pairs y^^yi it holds that q{y oUi) = 0, 
in which case Ay^y^ = is allowed as well. 

A relatively simple computation shows that the minimizing distribution 
P* in this case is such that the transition probabilities = j\Yt = i) of 

the approximating Markov chain coincide with the conditional probabilities 
Q{Yt^i = j\Yt = i) and the invariant (marginal) distribution under P* is 
the same as the one under Q. Moreover, in this case it is easy to show that 
even the Pythagorean identity [6] 

D{Q\\P) - D{Q\\P*) = D{P*\\P) 

holds true. A similar result holds for approximation by a k-step Markov 
chain. Unfortunately, such appealing closed form solutions do not exist if 
the minimization is carried out over stationary HMM measures. 
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5 Hankel matrix for stationary processes 



Given an integer n, we define two different orders on y^: the first lexico- 
graphical order [flo) and tlie last lexicographical order (llo). These orders 
have been introduced in [2] . In the flo the strings are ordered lexicographi- 
cahy reading from right to left. In the llo the strings are ordered lexicograph- 
ically reading from left to right (the ordinary lexicographical ordering). Let 
us first give an example. Let the output alphabet be 3^ = {0, 1} and n = 2. 
Then we have (in flo) that yj^^ = (00, 10, 01, 11) and yf^^ = (00, 01, 10, 11). 

On 3^* we define two enumerations: {ua)flo and {vp)uo- In both cases 
the first element of the enumeration is the empty string. For {ua)flo we then 
proceed with the ordering of y^ according to flo, then with the ordering 
of y"^ according to flo, and so on. The enumeration {vp)iio is obtained by 
having the empty string followed by the ordering of according to llo, 
then by the ordering of y"^ according to llo, and so on. In both cases the 
length of a string increases monotonically with the index a or (3. In order 
to make clear the introduced notation, we continue with the example where 
the output alphabet is 3^ = {0, 1}. In this case the two enumerations will 
be: 

{ua)flo = (0, 0, 1, 00, 10, 01, 11, 000, 100, 010, 110, 001, 101, oil. 111, . . .) 
and 

{vfi)iio = (0, 0, 1, 00, 01, 10, 11, 000, 001, 010, oil, 100, 101, 110, 111, . . .). 
We are now able to give the following 

Definition 5.1. For a stationary process with pdf p(-) the Hankel matrix 
H is the infinite matrix with elements p (uaVp), where Ua and vp are respec- 
tively the a-th and /3-th elements of the two enumerations. 



As an example we write below the upper left corner of the Hankel matrix of 
a stationary binary process (again with y = {0, 1}). In the following table, 
this matrix results from deleting the first row and first column. 








1 


00 01 10 11 







1 


p(0) Pil) 


p(oo) p{oi) pilo) p{n) 






1 


p(0) 

p(l) 


p(00) p(01) 
p(10) p(ll) 


p(OOO) p(OOl) p(OlO) p(Oll) 
p(lOO) p(lOl) p(llO) p(lll) 




00 
10 
01 

11 


p(00) 
p{lO) 
p{01) 

p(lll) 


p(OOO) p(OOl) 
p(lOO) p(lOl) 
plow) p(Oll) 
p(llO) p(lll) 


]9(0000) p(OOOl) p(OOlO) p(OOll) 
p(lOOO) p(lOOl) p(lOlO) p(lOll) 
p(OlOO) p(OlOl) p(OllO) p(Olll) 
]3(1100) p(llOl) p(lllO) p(llll) 
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Fix integers K and L. All the following formulas hold MK > and VL > 0. 



Let ui , ti2 , • • • , % with 7 



m 



K 



be the enumeration according to the flo of 



the strings of length K. Similarly let f 1, U2, • • • , w<5 with 6 = be the 
enumeration according to the llo of the strings of length L. 

Let us denote by ^kl the {K, L) block of H of size x given by 
its elements p{uiVj) with i = 1, . . . , 7 and j = 1,. . . ,5. The H matrix can 
then be partitioned as 



H 



Hoo Hoi 
Hio Hii 



HoL 



Hj^o Hii"! • • • H 



KL 



As the reader can readily see, the antidiagonal blocks ^kl (with K + L 
constant) contain the same probabilities. With abuse of language H is 
called a (block) Hankel matrix although in a true block Hankel matrix YIkl 
is constant along the antidiagonals. 

Because of the columns enumeration scheme {v (3)110, the block Hk^l+i 
of size x m^~^^ can be written as 

llK,L+l=[iiKLiyi) ■■■ HKLiVm)] (H) 

where HKLiue) is defined as 

^Khivd = \p{uiyevj)]i=i^,„^^j=i^,„^s (12) 

The Hankel matrix of a stationary HMM has special properties which will 
be instrumental for our treatment of the approximation problem. For an 
HMM the elements p{uiVj) of ^kl can be factorized, according to ([5]), as 

p{uiVj) = TrM{ui)M{vj)e. 

In matrix form this gives the factorization property 

itM{ui) 

[ M{vi)e ■ ■ ■ M{vs)e 



H 



KL 



Defining 



n 



K 



ttM{ui) 
7rM(ti^) 



L : = 



M{vs)e 



(13) 



13 



matrices of dimensions x and x respectively, we obtain that 

YIkl = ^kTl (14) 



Remark 5.2. From relation (jl4p it follows that the Hankel matrix of a 
stationary HMM can be factorized as 



H 



Ho 
Hi 



n 



K 



[ To Ti 



where the infinite matrix [ To Ti • • • Tl 
that Rank{Yi) < N. 



has N rows. It follows 



In the case of K = and L = 0, ^ and ^ still hold and 
Hoo = noro= [ 7rM(0) ] [ M(0) e ] =7re = l 

where in the last passage we use that vr is a probability vector. 

Next we are going to rewrite formula (|lip . Observe that the probabilities 
in (1121) take the form 



piuiVeVj) = TTM{ui)M{yeVj)e. 
The matrices HKiiye) can be factorized as 



[ M{yevi) 



TTM{Ua) 

Thus formula (jlip can be expressed as 

IIk,l+i = [ UKTLivi) 
= Uk [ TLivi) 



Hence 



L+l 



UKTLiVr, 

rL(ym) 



(15) 
(16) 
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and 



^L{ye) = [ M{yevi)e ■■■ M(y^w^)e ] 

= M{yi) [ M{vi)e ■■■ M{v^)e ] 

= Miy,)TL . (17) 

Note that Ti^yi) has the same dimensions as F^,. 



6 Divergence rate approximation 

In this section we wiU see how to approximate the divergence rate D{Q\\P) 
between a stationary process and an HMM by the informational divergence 
between the corresponding Hankel matrices. 

For two nonnegative numbers q and p their informational divergence is de- 
fined as D{q\\p) = q log ^ — q + p with the conventions 0/0 = 0, OlogO = 
and g/O = oo for q > 0. From the inequahty xlogx > a; — 1 it fohows that 
D{q\\p) > with equahty iS q = p. 

Definition 6.1. Let M,N e M™''". The informational divergence of M 
relative to N is 

D{M\\N) = D{Mij\\Nij) = Y,{Mij log ^ - Mi, + Nij) (18) 

It follows that L»(M||N) > with equality iff M = N. If Ei j- = 
Tliij-^ij ~ informational divergence reduces to the usual Kullback- 

Leibler divergence between probability distributions 

Z)(M||N)=5]M,,log^. (19) 

The divergence rate between two processes can be approximated by the 
informational divergence between their Hankel matrices, as we will demon- 
strate now. 

Let Q and P be measures as in Theorem 14.21 Denote by H^n and H^^ the 
{n, n) block of their Hankel matrices. A typical element of H„„ is 

q^^'^^UiVj) := QiY^""-^ = UiVj) Vn^ e 3^" in flo,yVj S in llo 

Analogously a typical element of is 

p^^^'^UiVj) := P(yo2«~i = UiVj) E m floyvj G in llo 
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The informational divergence between the Hankel blocks is 



J2 q^''''Hu,v,)\og 



Er 



log 



(20) 



(21) 



which, when compared to the definition of divergence rate, provides the 
following 



Theorem 6.2. Assume that P and Q are as in Theorem \4-S\ Then the 
divergence rate exists and 



hm ^D{Ylnn\\Kn) = D{Q\\P). 



(22) 



This theorem motivates the use of ^-D(H„„||H^„), for n large enough, as 
an approximation of the divergence rate between Q and P. 



7 Algorithm for approximate realization 

We take as an approximation of the divergence rate between measures the 
informational divergence between the corresponding Hankel blocks. Indeed, 
Theorem 16.21 motivates, for n large, to replace Problem 13.21 bv 

minD(H„„||H^„). (23) 

By the HMMs block factorization property. Equation (|14p . it holds that 
Unn = n^F^. The minimization in ([25]) thus reduces to the Nonnegative 
Matrix Factorization (NMF) problem 

min i?(H„„|ln„r„) (24) 

under the constraints e'^Ylne = 1 and F^e = e. 

A minimizing nonnegative matrix always exists, see [10], Proposition 2.1. 
We seek a procedure for the construction of a parametric representation of 
an optimal HMM, starting from a solution (n*,F*) of the minimization 
problem ()24p . Two extra steps involving NMF will eventually produce the 
parameters M{y). We present the whole procedure as a three steps algo- 
rithm. At each step we provide some additional details and comments. 
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Algorithm 

1. Law approximation step 

Given: H„„ 

Problem: min D(H„„||n„r„) s.t. e^YlnG = 1 and F^e = e 

Solution: (n*„,r;) 

Here we consider problem (j24p . The minimization takes place under 
the constraints e^Il^e = 1 and r„e = e. A numerical procedure 
for carrying out the optimization problem has been proposed by Lee 
and Seung [15j and results concerning its asymptotic behaviour can 
be found in [lOj. The solutions U*^ and F* are of respective sizes 
(m" X N) and {N x m"). 

2. Approximate realization step 

Given: H„^„+i and U*^ from step 1 

Problem: mini:>(H„^„+i||n*„r„+i) s.t. T^+ie = e. 

Solution: r*_^]^ 

Here we consider the block Hn^n+i- Then the factorization (jl4p sug- 
gests the minimization of D(H„ „+i ||n*„r„_|_i) with respect to F^+i 
with n* fixed, obtained from step 1. The solution F*_,_;^ is of size 
(A^ X m""^^). The numerical procedure of [10] can be easily modified 
under the additional constraint imposed by specifying the H* matrix. 
In this case convergence takes place to a global minimum, which fol- 
lows from application of results by Csiszar and Tusnady [7j. 

3. Parametrization step 

Given: F* from step 1, F* , from step 2 and F^ x := F* = 



"P* 

n 






















n 



Problem: umiD (^F^+i ||MF^„) j s.t. Me = e 
Solution: M* = [M*(yi) . . . M*(y^)] 

The basis of this step is motivated by equations and ([T7D resulting 
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m 



r„+i = [ M(yi)r„ 

Defining the block matrices 



M{y. 



(25) 



M := [M(yi) . . . M{ym)] of dimension N x mN and 



(n) 





■•• 

r„ 



of dimension mN x m""*"^, 



we immediately obtain from ([25]) the identity r„+i = Mr(„-) . 
We denote by F^^^ the matrix obtained from r(„) by replacing the Tn 
with r* obtained from step 1. Let r*_,_j^ be the matrix obtained from 
step 2. Then (f25l) suggests to minimize D{T'^_^_l\\'MT*^^•^) with respect 
to M under the constraint Me = e. 

The minimization can be carried out with a factorization procedure 
similar to the one in step 2, leading to the solution M*. The submatri- 
ces M*{yi) with i = 1, . . . , m of dimension N x N are the parameters 
of the best HMM approximation. 



Notice that the constraint Me = e, imposed at step 3, corresponds to the 
requirement that the transition matrix of the underlying Markov chain A 
is stochastic and the resulting A* = Y2yi iVi) used as the transition 
matrix of the approximate model. 



The algorithm when the true distribution is an HMM 

Suppose that the stationary law Q that one wants to approximate is ac- 
tually that of a stationary HMM of order N. Then Equations (|14p . used 
to construct step 1 of the algorithm, ([15]) used for step 2, and ([T6]) and 
([17p . used for step 3 are valid for both Q and P and for the proper indices 
n,n + 1. The generic Hankel block of the Q measure therefore factorizes as 
H„„ = n„r„. In the (generic) full rank case, the matrices 11'^, F* resulting 
from step 1, will satisfy the relations 11'^ = Tl^S and ST* = r„, for some 
invertible matrix S, with the property that Se = e. It also follows that 
ST'^_^_i = Tn+i and one easily verifies that the matrices M*{yi) from step 3 
satisfy SM*{yi) = M{yi)S. Consequently SA* = AS and vr* = nS is an 
invariant vector of A*. The probabilities p*{u) = ■n*M*{u)e induced by the 
algorithm are therefore equal to the original probabilities p{u) = TTM{u)e. 
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The algorithm under Markov approximation 



Here we analyze the behavior of the algorithm in the case where one wants 
to approximate a given stationary process Y, having distribution Q, with 
a Markov chain having distribution P. We know from Remark 14.71 that 
the optimal divergence rate approximation P* is such that the transition 
probabilities P*(Yt+i = j\Yt = i) coincide with the conditional probabilities 
q{j\i) := QiYfj^i = j\Yt = i). We show that, in this case, the final outcome 
of the algorithm is in agreement with this result. 

Recall that the algorithm was motivated by the properties of the Hankel 
matrix of HMMs. When the approximating model class is Markov, we can 
still represent its elements as HMMs. Let {1, . . . , A^} be the space state of 
the Markov chain with transition matrix A, then the matrices M{y) assume 
the special structure 

mij{y) = Aij5jy. (26) 

The corresponding matrix XI^ consists of all row vectors 7rM(u), with u = 
2/1 ■ ■ ■ 2/n (in flo) of length n. The generic row takes the form of an iV-vector 
consisting of zeros and on the j-th place p(>^*^l"l = u) iff j = fjn- Write 
u = uyn, where u runs through all strings of length n — 1. It follows that 
n„ has the following block-diagonal structure, 



ui 
ul 








n 



N 



(27) 



where each block Tin is a column vector consisting of the probabilities 
P(l^* = uj). The Markov assumption does not impose any special struc- 
ture on the matrices . 

In step 1 of the algorithm we therefore impose that the matrix n„ has 
the block-diagonal structure (|27p . Write the matrix Tn as 



where the Tn are row vectors. Likewise we decompose the Hankel matrix 
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H 



H 



N 



The minimization Z)(H„„| |n„r„) under the constraint r„e = e reduces to 
the (decoupled) minimization problems L'(H:^„||nnr:^) with constraints 
Tne = e. These problems can be solved explicitly, since their inner size is 
equal to one. The solutions are 



n* 



H-' e 



and 



nn 



Stated in other terms, II*'' has typical elements q{uj) and has typical 



elements 



-^^QjY {v a string of length n). 
In step 2 of the algorithm something similar takes place. The solution 
^n+i typical elements '^^jyi where w is a string of length n + 1. 
In step 3 of the algorithm, the matrix M takes the form 



M = [M^••• ,M 



Nl 



where, by virtue of (j26p . AP = [0, • • • ,0,m^ ,0, ■ ■ ■ ,0], with the column 
vector on the j-th place. It turns out that also this step of the algorithm 
has an explicit solution, given by m* = q{j\i)- Hence the corresponding 
matrix of transition probabilities A* has elements A*j = q{j\i), in agreement 
with Remark 14.71 
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